Ecobici is a government sponsored program to encourage people to use bycicles as a means of transportation in Mexico City. Users of the program pay an annual fee to loan a bike at any time for a maximum duration of 45 minutes. There are bike stations in different parts of the city and one can take a bike on any station and return it at any other.
While the program has enjoyed good general reception since it was launched in 2010, there is still a lot of room for increased adoption. In order to create a successful expansion plan, one must gather knowledge about the factors that seem to influence adoption. One way to do that is by understanding the current state of adoption faceted by different traits, such as gender, age, location, etc.
For this project, we’ll attempt to answer the following questions:
Any other questions that arise during the exploration we’ll be analyzed too.
The dataset we’ll be analyzing in this project is that of users of the Ecobici program from February 15, 2010 to December 31, 2013.
The columns of the dataset include:
(You’ll notice that in the registration.type variable, there’s one value called “ALTA TELMEX”. Telmex, in case you didn’t know, is the largest phone company in Mexico, and has partnered with Mexico City’s government to allow payment of the annual fee for Ecobici through a user’s phone bill.)
We’ll also explore some variables that are not included in the previous list, but which can be easily derived from the original dataset:
Before we begin, we’ll go through a couple of steps to put the dataset in a form more suitable for analysis.
Notice that we’ve explicitily specified that we don’t want to read strings as factors. This is because we don’t want dates to be turned into factors. We also want to strip whitespace so as to avoid additional spurious factor levels later on:
# Set your cdw to the source file location
users <- read.csv('ecobici_usuarios.csv',
header=T, stringsAsFactors=F, strip.white=T)
Our dataset contains 111935 entries of 11 variables, and by running the str command, we can see that all columns are named in Spanish:
str(users)
## 'data.frame': 111935 obs. of 11 variables:
## $ USUARIO : int 145 669 856 865 26538 956 28702 901 990 980 ...
## $ TARJETA : chr "2938835630" "2938833614" "2861480206" "2935342734" ...
## $ SEXO : chr "M" "M" "M" "M" ...
## $ FECHA.DE.NACIMIENTO : chr "1964-11-19" "1978-08-14" "1979-04-01" "1978-09-07" ...
## $ COLONIA : chr "El Prado" "Zona Escolar" "Condesa" "Cuauhtémoc" ...
## $ DELEGACION : chr "Coyoacán" "Gustavo A. Madero" "Cuauhtémoc" "Cuauhtémoc" ...
## $ ESTADO : chr "D.F." "D.F." "D.F." "D.F." ...
## $ MEDIO.DE.INSCRIPCION: chr "ALTA WEB" "ALTA WEB" "ALTA" "ALTA WEB" ...
## $ FECHA.DE.INSCRIPCION: chr "2010-02-16" "2010-02-18" "2010-02-19" "2010-02-19" ...
## $ USOS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ STATUS : chr "Vigente" "Vigente" "Vigente" "Vigente" ...
So we’ll translate them to English to make it easier for everyone to understand what each variable stands for:
names(users) <- c(
"user.id", "card.id", "gender", "birthday", "borough", "municipality",
"state","registration.type", "registration.date", "rides.count", "status")
names(users)
## [1] "user.id" "card.id" "gender"
## [4] "birthday" "borough" "municipality"
## [7] "state" "registration.type" "registration.date"
## [10] "rides.count" "status"
Some of our categorical variables also have names in Spanish. We’ll translate those as well in the next step.
head(users$registration.type)
## [1] "ALTA WEB" "ALTA WEB" "ALTA" "ALTA WEB" "ALTA" "ALTA"
head(users$status)
## [1] "Vigente" "Vigente" "Vigente" "Vigente" "Inactivo" "Vigente"
Our next step is to do a little bit of cleaning and parsing as follows:
# If regexps are Greek to you, check out this guide:
# http://www.regular-expressions.info/rlanguage.html
#
# You can also check ?regex for built-in help in RStudio
whitespace_to_na <- function(str) gsub("^[[:space:]]*$", NA, str)
# Convert blank entries to NA, but only for char columns
char_cols <- sapply(users, is.character)
users[char_cols] <- lapply(users[char_cols], whitespace_to_na)
users[char_cols] <- lapply(users[char_cols], tolower)
users <- na.omit(users)
# Parse date columns (dates are in the year-month-day format, hence ymd)
library(lubridate)
date_cols <- c("birthday", "registration.date")
users[date_cols] <- lapply(users[date_cols], ymd)
# Convert all other columns to factors
cols <- !names(users) %in%
c("rides.count", "user.id", "birthday", "registration.date")
users[cols] <- lapply(users[cols], factor)
# Once a column is a factor, renaming their values is trivial using levels(...)
levels(users$registration.type) = c("normal", "telmex", "web")
levels(users$status) = c("inactive", "active")
levels(users$gender) = c("female", "male")
Let’s make sure that the renaming of categorical variables went well:
head(users[, c("registration.type", "status", "gender")])
## registration.type status gender
## 1 web active male
## 2 web active male
## 3 normal active male
## 4 web active male
## 6 normal active female
## 7 normal inactive male
Now we’ll extract some extra variables. We’ll begin with tenure (the number of days since the user registered):
reference_day <- ymd("2013-12-31")
users$tenure <- with(users, as.period(
new_interval(registration.date, reference_day), units = "days")$day)
Then we’ll add age:
users$age <-
with(users, as.period(reference_day - birthday, units = "years")$year)
As is usual with datasets in the real world, there’s some bad data that we need to filter out. For our dataset, it turns out that there are 26 people who were born in the future or were riding an adult bike at age 3 or less!
dim(subset(users, age <= 3))
## [1] 26 13
So, let’s remove them:
# Filter out users who claim to have been born today or in the future
users <- subset(users, age >= 4)
You might have noticed that we used a different method for the age variable than we did for tenure. This is because, for age, using new_interval throws an error which appears to be due to a bug in the lubridate package.
OK, so it looks like we’re now all set for our exploratory data analysis. Let’s get started!
Throughout the analysis we’ll be using a couple of functions to reduce the amount of coding we need to do in order to create some plots.
library(gridExtra)
library(ggplot2)
library(lazyeval)
library(dplyr)
library(tidyr)
median_color <- "darkgoldenrod2"
mean_color <- "darkgoldenrod4"
percentile_color <- "chartreuse3"
# Unlike the functions of the ggplot package that accept symbols as variable
# names, all of the following functions
make_histogram <- function (data, var, binwidth=NA, drop=F) {
ggplot(aes_string(x = var), data = data) +
make_histogram_layer(binwidth = binwidth, drop = drop)
}
make_histogram_by <- function (data, var, facet, binwidth=NA, drop=F) {
ggplot(aes_string(x = var, fill = facet), data = data) +
make_histogram_layer(binwidth = binwidth, drop = drop)
}
make_decorated_histogram <- function(data, var, binwidth=NA) {
stats <- make_stats_summary(data, var)
data %>%
make_histogram(var, binwidth = binwidth) %>%
decorate_with_stats(stats)
}
make_decorated_histogram_by <-
function(data, var, facet, binwidth=NA, scales="fixed") {
stats <- make_stats_summary_by(data, var, facet)
data %>%
make_histogram(var, binwidth = binwidth) %>%
decorate_with_stats(stats) +
facet_wrap(interp(~x, x = as.name(facet)), ncol = 1, scales = scales)
}
make_ordered_histogram <- function(data, var, limit=10, decreasing=F) {
top <- head(sort(table(data[,var]), decreasing = decreasing), limit)
top <- as.data.frame(top)
top[,var] <- rownames(top)
names(top) <- c("count", "name")
rownames(top) <- 1:nrow(top)
# change the levels to the sorted names so ggplot will plot them in sorted
# order
top$name <- factor(top$name, levels = top$name)
ggplot(aes(x = name, y = count), data = top) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
# This method is designed for categorical values that may have too many values to be
# comfortably displayed on the screen, so that they require some trimming after a
# certain limit
make_ordered_histogram_by <- function(data, var, facet, limit=10, decreasing=F, relative=F) {
# TODO: figure out how to concisely switch the sort order using the "decreasing" param
grouped_data <- users %>%
group_by_(facet, var) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
ungroup()
# remove the tbl_df wrapper that the dplyr functions add so that grouped_data[,var]
# returns a vector
grouped_data <- as.data.frame(grouped_data)
# this ensures that the plot will preserve the data in the right order
grouped_data[,var] <- factor(grouped_data[,var], levels = unique(grouped_data[,var]))
facet_levels <- levels(grouped_data[,facet])
# switch to wide format so we can get the top N * LIMIT rows in the dataset,
# where N is the number of levels in the facet variable
grouped_data <- grouped_data %>%
spread_(facet, "count") %>%
head(limit) %>%
gather_(facet, "count", facet_levels)
position <- ifelse(relative, "fill", "stack")
ggplot(aes_string(x = var, y = "count", fill = facet), data = grouped_data) +
geom_histogram(stat = 'identity', position = position) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
default_x_scale <- function(var, step=10) {
scale_x_continuous(breaks = seq(min(var), max(var), step))
}
default_y_scale <- function(var, step=10) {
scale_y_continuous(breaks = seq(min(var), max(var), step))
}
make_histogram_layer <- function(binwidth, drop=F) {
layer <- geom_histogram()
if (!is.na(binwidth))
layer <- geom_histogram(binwidth = binwidth, drop = drop)
layer
}
median_line <- function (stats_data)
geom_vline(aes(xintercept = median), stats_data, color = median_color)
mean_line <- function (stats_data)
geom_vline(aes(xintercept = mean), stats_data, color = mean_color)
q1_line <- function(stats_data)
geom_vline(aes(xintercept = q1), stats_data, color = percentile_color)
q3_line <- function(stats_data)
geom_vline(aes(xintercept = q3), stats_data, color = percentile_color)
generate_stats_summary_exprs <- function(var) {
# see the comment in "make_stats_summary" for details on this somewhat
# complicated function
var <- as.name(var)
exprs <- list(interp(~mean(x), x = var),
interp(~median(x), x = var),
interp(~as.numeric(quantile(x, 0.25)), x = var),
interp(~as.numeric(quantile(x, 0.75)), x = var))
list (vars = c("mean", "median", "q1", "q3"), exprs = exprs)
}
make_stats_summary <- function(data, var) {
# Unfortunately, "summarise_" doesn't support string literals directly
# The workaround is to generate expressions from strings.
#
# For details, see:
# http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
with(generate_stats_summary_exprs(var),
summarise_(data, .dots = setNames(exprs, vars)))
}
make_stats_summary_by <- function(data, var, facet) {
grouped_data <- group_by_(data, facet)
# See the comment in "make_stats_summary"
with(generate_stats_summary_exprs(var),
summarise_(grouped_data, .dots = setNames(exprs, vars)))
}
decorate_with_stats <- function(plot, stats) {
plot +
median_line(stats) +
mean_line(stats) +
q1_line(stats) +
q3_line(stats)
}
bucket <- function(x, bucket_size) {
round(x / bucket_size) * bucket_size
}
To get a bird’s-eye overview of our dataset, let’s create a scatterplot matrix:
library(GGally)
users.reduced <- subset(users, status == 'active', select=-c(user.id, card.id, state, borough, municipality, registration.date, birthday))
set.seed(20022012)
users_sample <- users.reduced[sample(1:nrow(users.reduced), 10000), ]
ggpairs(users_sample, params = c(shape = I('.'), outlier.shape = I('.')))
Some of the resulting plots are very easy to interpret (e.g. histograms of categorical variables), while others, such as the histogram and density plots of rides.count require further investigation to understand completly. In the following sections, we’ll try to understand such plots better.
Let’s begin by taking a look at summary of the variables of interest, one by one in order to avoid the crammed output that doing summary(users) would give us:
summary(users$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 28.00 33.00 35.29 41.00 83.00
summary(users$tenure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 191.0 386.0 544.9 1015.0 1415.0
summary(users$rides.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 6.00 34.92 38.00 1254.00
summary(users$gender)
## female male
## 42311 68585
summary(users$status)
## inactive active
## 13216 97680
summary(subset(users, select=c(state, municipality, borough)))
## state municipality borough
## d.f. :95822 cuauhtémoc :35743 roma norte : 6124
## edo. mex.:14843 miguel hidalgo :19675 condesa : 5203
## hidalgo : 28 benito juárez : 8857 cuauhtémoc : 4617
## morelos : 26 gustavo a. madero: 5800 juárez : 2731
## querétaro: 21 álvaro obregón : 4916 hipódromo : 2557
## jalisco : 20 coyoacán : 3930 polanco v sección: 2444
## (Other) : 136 (Other) :31975 (Other) :87220
summary(users$registration.type)
## normal telmex web
## 105158 977 4761
summary(subset(users, select=c(birthday, registration.date)))
## birthday registration.date
## Min. :1930-05-30 00:00:00 Min. :2010-02-15 00:00:00
## 1st Qu.:1972-08-17 00:00:00 1st Qu.:2011-03-22 00:00:00
## Median :1980-08-12 00:00:00 Median :2012-12-10 00:00:00
## Mean :1978-03-18 01:14:23 Mean :2012-07-04 01:26:28
## 3rd Qu.:1985-09-09 00:00:00 3rd Qu.:2013-06-23 00:00:00
## Max. :2000-04-05 00:00:00 Max. :2013-12-31 00:00:00
Some facts we can see right away include:
The number of registered female users (42311) versus male users (68585)
The oldest birthday (May 30 1930)
The states with most users: D.F (Federal District) with 95822 users, followed by Edo. Mex. (State of Mexico) with 14843 users
The municipalities with most users: Cuauhtémoc with 35743 users and Miguel Hidalgo with 19675 users (which comes as no surprise, given that most bike stations are in those areas.)
The range of ages goes from 13 to 83 years old
The number of active users (97680) vs the number of inactive users (13216)
We’ll explore these facts and others in much more detail in the following sections.
One of the original questions posed at the beginning of this project was: Does age play a role in the adoption of Ecobici? The following histogram may give us some insight into the answer:
make_histogram(users, "age", binwidth = 1)
A couple of things are immediately obvious: there are no gaps in the range of ages, the distribution is unimodal and skewed to the left, the mode appears to be around 30, and the outline of the histogram is kind of smooth (i.e. the function that relates adoption of the program and age of the users seems to be smooth.)
The skewness of the distribution is most likely due to the fact that Ecobici requires a credit card or phone bill invoice for registration. Under age people can still register, but their parents need to accompany them.
Let’s add some more detail to the graph to investigate the mean and median, as well as the interquartile range (IQR):
make_decorated_histogram(users, "age", binwidth = 1) +
default_x_scale(users$age, step = 5)
The median seems to be around 33, with the mean at around 35, a consequence of the longer right tail of the distribution. The mode, which represents the largest age group in the program, is right at 28 years old.
Let’s now compare the female and male populations in a frequency polygon plot:
ggplot(aes(x = age), data = users) +
geom_freqpoly(aes(color = gender)) +
default_x_scale(users$age, step = 5)
It’s clear that the distributions are somewhat similar in shape but men dominate in number. To see this more clearly, let’s draw them in different plots in the same column, along with their mean, median and IQR:
make_decorated_histogram_by(users, "age", "gender", binwidth = 1, scales = "free_y") +
default_x_scale(users$age, step = 2)
Now it’s easier to see that the male population tends to be a bit older than the female population. Also, it seems like the variance of the male population is greater. Let’s crunch the numbers to confirm or deny this perception:
by(users$age, users$gender, sd)
## users$gender: female
## [1] 9.747954
## --------------------------------------------------------
## users$gender: male
## [1] 10.53073
And, indeed, the male population has a greater variance, but only by a small margin.
Finally, let’s create a boxplot to get another look at the same data:
ggplot(aes(x = gender, y = age), data = users) +
geom_boxplot() +
default_y_scale(users$age, step = 5)
One piece of information that the boxplot gives us is the age at which men or women start being considered rare in their respective populations. Notice that in the case of men, we see the first outliers after age 63, while in the case of women it is after age 58. This supports our hypothesis that the male population tends to be older.
Let’s now explore the distribution partitioned by type of registration:
make_histogram_by(users, "age", "registration.type", binwidth = 1) +
default_x_scale(users$age, step=5)
What the graph shows is how an overwhelming majority of users signup using the traditional means (as opposed to signing up on the web or using Telmex).
Let’s now compare the three populations on a free scale with a more elaborate histogram:
make_decorated_histogram_by(users, "age", "registration.type", binwidth = 1,
scales = "free_y") +
default_x_scale(users$age, step=5)
Clearly, the Telmex population seems to be much younger, though it’s unclear why this might be. One possible explanation is that younger people are less likely to have a credit card, so they may be using their phone bill (or their parents’) to join the program.
Running a summary by type of registration, we confirm that the Telmex population does tend to be younger:
by(users$age, users$registration.type, summary)
## users$registration.type: normal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 28.00 33.00 35.27 41.00 83.00
## --------------------------------------------------------
## users$registration.type: telmex
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 24.00 29.00 32.35 39.00 75.00
## --------------------------------------------------------
## users$registration.type: web
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 29.00 34.00 36.43 42.00 79.00
It also seems that the Telmex population has greater variance. Crunching the numbers we can see whether this is true:
by(users$age, users$registration.type, sd)
## users$registration.type: normal
## [1] 10.28671
## --------------------------------------------------------
## users$registration.type: telmex
## [1] 11.24539
## --------------------------------------------------------
## users$registration.type: web
## [1] 9.996479
Finally, let’s make a plot where we can see the distribution of age by gender and type of registration to see whether women or men have a preference in the way they sign up to the program:
make_histogram_by(users, "age", "gender", binwidth = 1) +
facet_wrap(~registration.type, ncol = 1, scales = "free_y")
There appears to be no significant difference in the distributions of men and women in either the normal or web populations, but it could be argued that there’s something going on in the case of Telmex where the variations in the distributions of men and women are less in sync than in the other populations. This is an intriguing finding, but we have no further data to make a hypothesis about it.
Let’s now proceed to analyze the number of rides for users:
summary(users$rides.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 6.00 34.92 38.00 1254.00
make_histogram(users, "rides.count", binwidth=10)
It looks like we have a heavily skewed distribution, with a very thin and long tail. To get a better visualization, we’ll apply a scale transformation on the x axis:
make_histogram(users, "rides.count") +
scale_x_log10(breaks = c(1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500, 1000))
There’s an important and very evident finding here: a rather large number of people signed up for the program, used it a couple of times and then stopped using it. Let’s see the exact numbers:
distribution <- table(users$rides.count)
head(distribution, 10)
##
## 0 1 2 3 4 5 6 7 8 9
## 40280 3857 3403 2809 2444 2195 1866 1754 1587 1502
tail(distribution, 10)
##
## 838 872 878 888 909 915 944 946 1077 1254
## 1 1 1 1 1 1 1 1 1 1
There is a very large number of people with no rides. Could this be bad data or is it true that almost 37% of the users signed up for the program but then never actually used it at all?
Let’s add more information to the previous plot:
make_decorated_histogram(users, "rides.count") +
scale_x_log10(breaks = c(1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500, 1000))
The median number of rides appears to be 6. Let’s confirm the numbers analytically:
summary(users$rides.count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 6.00 34.92 38.00 1254.00
This has really caught our attention. Let’s see what proportion of the total number of users have done less than the median number of rides:
m <- with(users, median(rides.count))
nrow(subset(users, rides.count <= m)) / nrow(users)
## [1] 0.5126785
This number is quite surprising, around 51% of users are people who have just tried the program a couple of times! Certainly not what we might call “active users”. This is an important finding.
Now, let’s analyze whether this phenomenon appears regardless of gender:
make_decorated_histogram_by(users, "rides.count", "gender") +
scale_x_log10(breaks = c(1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500, 1000))
In the case of women, the phenomenon seems to be much worse with a median of 3 rides, which we can confirm analytically:
by(users$rides.count, users$gender, summary)
## users$gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 3.00 25.38 26.00 944.00
## --------------------------------------------------------
## users$gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 8.00 40.81 48.00 1254.00
by(users$rides.count, users$gender, IQR)
## users$gender: female
## [1] 26
## --------------------------------------------------------
## users$gender: male
## [1] 48
These last numbers show something surprising: women have a median of 3 rides, while men have a median of 8 rides. The means are higher because of outliers in both groups that drag them up to 40.8117518 in the case of men, and 25.3770414 in the case of women.
Let’s the see the proportion of women (relative only to the female population) who have taken a number of rides equal to or less than the median:
women <- subset(users, gender == 'female')
m <- with(women, median(rides.count))
nrow(subset(women, rides.count <= m)) / nrow(women)
## [1] 0.5013117
Despite the lower median, the proportion of “active users” (informally defined here as those who have taken more than the median number of trips in their population) for women is actually quite close to that of the general population.
However, if we just concentrate on the values below the median, and we look at the proportions, rather than the full counts, it’s immediately obvious that the phenomenon is a bit more pronounced in the case of women:
ggplot(aes(x = rides.count, fill = gender), data = users) +
geom_histogram(aes(y = ..density..), position = "fill") +
scale_x_log10(breaks = c(1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500, 1000)) +
geom_hline(aes(yintercept = 0.5), color = 'black')
From this last plot, there’s no doubt that the most devoted “active” users of Ecobici are men (people with over a 1000 rides.) Also interesting is that despite the downward trend after about 25 rides for women, they make an important comeback when getting near 1000 with a proportion of around 0.30
Finally, let’s see what additional information we can glean from a boxplot:
ggplot(aes(x = gender), data = users) +
geom_boxplot(aes(y = rides.count, color = gender, notch = T))
ggplot(aes(x = gender), data = users) +
geom_boxplot(aes(y = rides.count, color = gender)) +
scale_y_sqrt(breaks = c(1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500, 1000))
Well, it seems that women with over a 150 rides are quite rare as far as their population goes, while men need to surpass the 300 rides to be considered rare.
We’ll now take a look at tenure, the number of days a user has been in the program:
with(users, summary(tenure))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 191.0 386.0 544.9 1015.0 1415.0
make_histogram(users, "tenure") +
default_x_scale(users$tenure, step = 100)
The distribution shown by the previous plot tells us at least three things:
Most people using Ecobici joined relatively recently (within the last year and a half or so).
The second biggest group of users seems to correspond to the early adopters back in 2010 when Ecobici was first launched.
Something happened after the second year of operation that caused a diminished number of new users. Maybe there was just not enough advertising or only a limited number of bike stations was available, thus limiting the number of potential users. However, that clearly changed during the third year of its operation.
Let’s change the binwidth of the histogram to 1 to see if we notice anything unusual:
make_histogram(users, "tenure", binwidth = 1) +
default_x_scale(users$tenure, step = 100)
The overall shape of the distribution seems the same, but we do see two or three very distinct points with well over 400 counts. We’re not sure what these might represent, but they don’t seem to warrant much attention.
Another comparison we can make is whether the Federal District and the State of Mexico have the same tenure distributions. Let’s find out:
make_decorated_histogram_by(subset(users, state == 'd.f.' | state == 'edo. mex.'),
"tenure", "state", binwidth = 10, scales = "free_y") +
default_x_scale(users$tenure, step = 100)
The overall shapes of the distributions are quite similar, but the State of Mexico does have a lower median and mean, and generally lower amounts of people with tenures between 290 and 500 days. This means that for some reason after its first year of operation, the number of new Ecobici users coming from the State of Mexico declined much more rapidly than its counterpart in the Federal District.
Finally, let’s see whether faceting by gender shows anything interesting:
ggplot(aes(x = tenure), data = users) +
geom_freqpoly(aes(color = gender)) +
default_x_scale(users$tenure, step = 100)
The distributions are really similar in shape, so the answer appears to be no.
Our dataset contains three variables related to location. We’re interested in seeing which locations have the most users. For the case of states there’s really no point in plotting in detail, as the Federal District and the State of Mexico dwarf all other states:
make_histogram(users, "state") +
theme(axis.text.x = element_text(angle=90))
In the case of municipalities, we’ll focus on those in the Federal District:
df_users <- subset(users, state == 'd.f.')
make_histogram_by(df_users, "municipality", "gender", drop = T) +
scale_y_log10() +
theme(axis.text.x = element_text(angle=90))
For someone who lives in the Federal District, it’s obvious that there’s more municipalities in this graph than there should be. The Federal District is divided into 16 regions (known in Spanish as Delegaciones), and this graph shows way more regions. Taking a cursory look at the regions displayed, it appears that some of them are part of the State of Mexico. Unfortunately, there’s many of them obscuring those that we care about.
So it seems like we’ve hit a dead-end here, as a proper analysis of users by state and municipality requires the data to be correct. One thing we could do is to ensure consistency between municipalities and states, but doing so requires an official listing of municipalities per state (a “golden standard”), as well as applying some fuzzy matching techniques to ensure all municipalities have a single unique name (as it stands right now, it’s quite possible that the same municipality name was input using just slightly different names, due to spelling errors, data corruption, lack of accents, etc.)
Nevertheless, it’s likely that, even if states and municipalities are not matched up correctly in many cases, the rest of the data is fine (i.e. people did write correctly their municipality and borough of residence), so we can still plot a histogram including the municipalities and boroughs with most registered users.
We’ll do just that by sorting the regions and making them an ordered factor before we plot (you can see the details in the Common Functions section above):
make_ordered_histogram(users, "municipality", limit = 15, decreasing = T)
Clearly, the distribution is really far from uniform, with just two or three municipalities leading the charts.
Let’s do the same plot for boroughs now:
make_ordered_histogram(users, "borough", limit = 15, decreasing = T)
We can also see if there are places, amongst those with most users, where the proportion between men and women deviates significantly from that of the general population:
make_ordered_histogram_by(users, "municipality", "gender", limit = 15,
decreasing = T, relative = T) +
geom_hline(aes(yintercept = 0.5), linetype="longdash") +
geom_hline(aes(yintercept = 0.35), linetype="longdash") +
scale_y_continuous(breaks = seq(0, 1, 0.1))
In general, men seem to be majority, but never exceeding 65%. Let’s see whether we can find a place where this exceeds that by including more boroughs in our plot:
make_ordered_histogram_by(users, "borough", "gender", limit = 50,
decreasing = T, relative=T) +
geom_hline(aes(yintercept = 0.5), linetype="longdash") +
geom_hline(aes(yintercept = 0.31), linetype="longdash") +
scale_y_continuous(breaks = seq(0, 1, 0.1))
Here we notice for the first time a significant variation: the borough “Centro (Area 1)” (part of Downtown Mexico City) does have a proportion of men close to 69%, the biggest thus far. This is somewhat surprising, as intuitively one would expect that region to have a more balanced proportion.
Having become acquainted with the individual variables of the dataset, we’ll now proceed to look at the relationships between continuous variables in it.
For some of the plots involving the whole dataset, as opposed to summaries, we’ll be using a sample of 30,000 rows to the make our plotting faster.
set.seed(7670767)
users_sample <- users[sample(1:nrow(users), 40000), ]
We’ll begin by looking at the relationship between age and tenure. We wonder who tends to stick to the program longer, young people or older people:
library(RColorBrewer)
ggplot(aes(x = age, y = tenure), data = users_sample) +
geom_point(aes(color = gender)) +
default_x_scale(users_sample$age, step = 5) +
scale_color_brewer(palette = "Set1")
Well, other than showing the grouping of people in three large clusters of tenure, the graph doesn’t really tell us much more. Let’s add a bit of jitter and some alpha to the points to see if we uncover anything about the distribution of points:
ggplot(aes(x = age, y = tenure), data = users_sample) +
geom_point(aes(color = gender), alpha = 1/5, position = 'jitter') +
default_x_scale(users_sample$age, step = 5) +
scale_color_brewer(palette = "Set1")
That’s just slightly better, but it doesn’t give us new information, all we see is things that we knew from the univariate analysis already (e.g. most of the population is young and middle age adults.)
There’s just too much data on the plot, so let’s try using a summary instead. We’ll add both the median and mean to the graph so we can compare, but we’ll use the median as a more representative measure of the population as a whole because it’s a more robust statistic in the presence of outliers:
ggplot(aes(x = age, y = tenure), data = users) +
geom_line(stat = 'summary', fun.y = median, color = median_color) +
geom_line(stat = 'summary', fun.y = mean, color = mean_color) +
default_x_scale(users_sample$age, step = 5)
Now, this is much better. We can see a trend more clearly. Let’s add a smoother to make it even more obvious:
users_summary <- users %>%
group_by(age, gender) %>%
summarise(tenure.mean = mean(tenure),
tenure.median = median(tenure)) %>%
ungroup()
p1 <- ggplot(aes(x = age, y = tenure.median), data = users_summary) +
geom_line(color = median_color) +
geom_smooth() +
default_x_scale(users$age, step = 5)
p2 <- ggplot(aes(x = age, y = tenure.mean), data = users_summary) +
geom_line(color = mean_color) +
geom_smooth() +
default_x_scale(users$age, step = 5)
grid.arrange(p1, p2, ncol = 2)
The story that this plot tells us is that the early adopters of Ecobici were seniors. The rest of the population came on board later on, thus having less tenure. Also, the bowl-shaped curve of the regression line tells us that there’s been a wave of young adults (around 30 years old) adopting the program more recently.
And now let’s facet it by gender and registration kind to see any possible variations in the distributions of those subpopulations:
users_summary <- users %>%
group_by(age, gender, registration.type) %>%
summarise(tenure.mean = mean(tenure),
tenure.median = median(tenure)) %>%
ungroup()
ggplot(aes(x = age, y = tenure.median / 365), data = users_summary) +
geom_line(color = median_color) +
geom_smooth() +
default_x_scale(users$age, step = 5) +
facet_grid(gender ~ registration.type)
Now, this is a very interesting plot, because it tells very different stories for people who registered online versus those who did it using Telmex or in the traditional way. For example, for the web subgroup, it appears as though more elderly women starting at age 63 have continued to join the program after the first year of operation, in contrast to elderly males who have stopped joining in more recent years.
However, it’s very important to keep in mind the smaller relative sizes of the web and Telmex populations when trying to interpret the plot. The plot itself gives us a warning of this as the variances are rather large in each case, thus decreasing our confidence that the story it seems to tell can generalize well.
Our next exploration will be getting at one of the initial questions: does age play a role in the adoption and active use of Ecobici?
During our analysis of the distribution of the age variable, we partially answered this question. However, it’s important to assess how age relates to the activity a user shows in the Ecobici program.
Let’s begin with a simple and straightforward scatterplot:
ggplot(aes(x = age, y = rides.count), data = users_sample) +
geom_point() +
default_x_scale(users_sample$age, step = 5)
Let’s try to see how the population is distributed by adding an alpha parameter:
ggplot(aes(x = age, y = rides.count), data = users_sample) +
geom_point(alpha = 1/10, position = position_jitter()) +
default_y_scale(users_sample$rides.count, step = 100) +
default_x_scale(users_sample$age, step = 5)
The bulk of the data seems to be below 300 rides, so in order to get a better view, let’s scale the vertical axis, change the plotting color to something light, and add two-dimensional contour lines to see where most points concentrate:
ggplot(aes(x = age, y = rides.count), data = users) +
geom_point(alpha = 1/10, position = position_jitter(), color = 'mistyrose3') +
stat_density2d(colour = 'black', size = 0.5, linetype = 2) +
scale_y_log10(breaks = c(1, 2, 3, 5, 10, 50, 100, 200, 500, 1000)) +
default_x_scale(users$age, step = 3) +
scale_color_brewer(palette = "Blues")
Now we can see what we discovered in the univariate analysis regarding the large number of users with just a few or zero rides. We can also see how the maximum number of points concentrates at around age 28 with a number of rides that varies between 10 and 100.
It would be nice if we could incorporate information about tenure in this same plot. One way to do it is to divide tenure into a discrete number of intervals and color code the points according to that.
Let’s begin by creating a new variable tenure_in_semesters:
users$tenure_in_semesters <- factor(bucket(users$tenure, 180) / 180)
And we’ll also facet by gender to include even more information in the plot
ggplot(aes(x = age, y = rides.count), data = users) +
geom_point(aes(color = tenure_in_semesters), position = position_jitter()) +
default_x_scale(users$age, step = 3) +
stat_density2d(colour = 'black', size = 0.65, linetype = 2) +
scale_y_log10(breaks = c(1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500)) +
facet_wrap(~gender) +
scale_color_brewer(palette = "Blues")
Now we have even more information in one plot: we can easily see that most of the users with zero rides also have large tenures, which is interesting but still hard to understand. Was there a massive signup of users in the beginning of the program as part of a large media event, maybe?
A couple of other things are revealed by this plot:
If we really want to understand how active users are, we need to plot not the raw count of rides, but instead something that reveals their daily activity, such as the average number of rides per week that they’ve taken. Let’s do that now:
ggplot(aes(x = age, y = (rides.count / tenure) * 7), data = users) +
geom_point(aes(color = tenure_in_semesters), position = position_jitter()) +
default_x_scale(users$age, step = 3) +
stat_density2d(colour = 'yellow', size = 0.5, linetype = 2) +
scale_y_log10(breaks = c(1, 2, 3, 5, 10, 20, 30)) +
facet_wrap(~gender) +
scale_color_brewer(palette = "Blues")
With this last plot, we can confirm that users who joined in the last year or two tend to be more active on average than those who joined earlier.
We’ve discovered quite a lot of things so far using scatterplots. Let’s now turn our attention to creating plots that use summary statistics on the number of rides per age group rather than the raw counts:
ggplot(aes(x = bucket(age, 3), y = rides.count), data = users) +
geom_line(stat = 'summary', fun.y = median, color = median_color) +
geom_line(stat = 'summary', fun.y = mean, color = mean_color) +
geom_line(stat='summary', fun.y = quantile, probs = 0.75,
linetype = "longdash", color = percentile_color, alpha = 0.75) +
geom_line(stat='summary', fun.y = quantile, probs = 0.25,,
linetype = "longdash", color = percentile_color, alpha = 0.75) +
default_x_scale(users$rides.count, step = 5)
Here there’s a clear trend: young people between the ages of 15 and 25 are the ones who take the most rides, and there’s a steady decline in the number of rides as people get older. There are, of couse, outliers that drag the mean upwards, but we care only about the typical user for the most part.
As we’ve done before, we’ll check what the distributions look like for different segments of our population, but the same caveats mentioned in the last section apply here regarding the remarkable differences in the distributions for Telmex and web registration types:
ggplot(aes(x = bucket(age, 3), y = rides.count), data = users) +
geom_line(stat = 'summary', fun.y = median, color = median_color) +
geom_line(stat = 'summary', fun.y = mean, color = mean_color) +
geom_line(stat='summary', fun.y = quantile, probs = 0.75,
linetype = "longdash", color = percentile_color) +
geom_line(stat='summary', fun.y = quantile, probs = 0.25,
linetype = "longdash", color = percentile_color) +
default_x_scale(users$rides.count, step = 5) +
facet_grid(gender ~ registration.type, scales = "free_y")
One thing worth noticing, though, is the difference in the distributions between men and women in the normal subpopulation. Notice how the trends seem to reverse betwen the ages of 70 and 80, going down in the case men and going up significantly in the case of women. Could this be due to the presence of outliers in a relatively small group? Let’s keep in the mind that the number of people in that age range is just 0.3922594% of the total.
Our analysis of the Ecobici users has helped us answer all of the questions we had in mind before beginning our exploration, but it has also revealed some unexpected findings. In the following sections, we’ll recapitulate the analysis done and present three plots that gave us important insights about the data.
While it was clear from the beginning that men used Ecobici more than women, the extent to which this was true wasn’t so clear. During the analysis, we found out that there was a downward trend in the proportion of women that completed an ever increasing number of rides, up to the point where men were the only ones to complete more a thousand rides.
library(scales)
ggplot(aes(x = rides.count, fill = gender), data = users) +
geom_histogram(aes(y = ..density..), position = "fill") +
scale_x_log10(breaks = c(1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500, 1000)) +
geom_hline(aes(yintercept = 0.5), linetype = "longdash") +
scale_y_continuous(breaks = seq(0, 1, 0.1), labels = percent_format()) +
ggtitle('Rides Completed in Ecobici by Gender and Type of Registration') +
xlab('Rides completed') +
ylab('Participation') +
facet_wrap(~registration.type)
Also interesting in this plot is the fact that, if we divide the population by the type of signup process they used to enroll in the program, the distributions of each group do change substantially.
We take this result with a grain of salt, however, given that the sizes of the Telmex and web groups are orders of magnitude smaller than that of the “normal” group, so it’s likely that they will have much higher variance and may not faithfully reflect the true underlying distribution (assuming that there is indeed such an out-of-sample distribution which is substantially different from the “normal” distribution)
In this plot there’s also a hint of another phenomenon that we’ll see more clearly in the following section, namely, that there’s a large proportion of people who signup for the program but end up taking just a few rides (or none at all.)
The Ecobici dataset contains a field called status which tells us whether a given user is currently paying their annual fee, but that is obviously not enough to tell whether they actually use the program or not and to what extent.
So, to find this out, we analyzed the number of rides people have taken against their age, the number of years they’ve been enrolled in the program, as well as their gender, in hopes of finding patterns:
users$tenure_in_years <- factor(bucket(users$tenure, 365) / 365)
ggplot(aes(x = age, y = rides.count), data = users) +
geom_point(aes(colour = tenure_in_years, weight = tenure_in_years),
position = position_jitter()) +
default_x_scale(users$age, step = 3) +
stat_density2d(colour = 'black', size = 0.45, linetype = 2) +
scale_y_log10(breaks = c(1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 500)) +
facet_wrap(~gender) +
scale_color_brewer(palette = "Blues") +
xlab('Age (in years)') +
ylab('Number of rides') +
ggtitle('Ecobici Usage by Age, Gender and Time Since Enrollment') +
theme(legend.position="top") +
labs(colour="Number of Years in the Ecobici Program")
This plot reveals several interesting things:
New users tend to have few rides in their history, as expected
The largest age groups in women seem to be confined to a more narrow range than those of them (notice how the two-dimensional density lines are closer together.) In other words, the female population tends to be younger than the male population in general.
A majority of users joined the program in the last one or two years
Most teenagers and other young people have joined the program very recently
Users who joined the program recently appear to be more active than those who joined earlier. We confirmed this during our analysis by plotting not against the raw number of rides for each user, but rather against the average number of rides per week.
The horizontal strips at the bottom of the plot shows that there is a large number of people who signed up for the program, took a few rides and then stopped using it. There are even people who didn’t use the service at all. In this latter case, they appear to be mostly people who signed at the beginning of the program back in 2010 (notice the dark blue strip at the bottom.)
Finally, we’ll present a plot of age versus the median number of years enrolled in the Ecobici program, faceted by gender and type of registration:
users_summary <- users %>%
group_by(age, gender, registration.type) %>%
summarise(tenure.mean = mean(tenure),
tenure.median = median(tenure)) %>%
ungroup()
ggplot(aes(x = age, y = tenure.median / 365), data = users_summary) +
geom_line(color = median_color) +
geom_smooth() +
default_x_scale(users$age, step = 5) +
facet_grid(gender ~ registration.type) +
xlab('Age (in years)') +
ylab('Median Number of Years Enrolled') +
ggtitle('Adoption Trends in Time by Age')
The story that this plot tells us is that the early adopters of Ecobici were seniors. The rest of the population came on board later on, thus having less tenure. Also, the bowl-shaped curve of the regression line tells us that there’s been a wave of young adults (around 30 years old) adopting the program more recently.
As mentioned in a previous section, the differences in the relative sizes of the subpopulations make it hard to believe with confidence that the distributions depicted by the regression lines are indeed a faithful reflection of the underlying distributions.
We started our exploration with some initial questions in mind, but we also knew we might end up exploring other things as we dug deeper into the relationships hidden in the data.
The scatterplot matrix was a very useful tool to give us a very quick overview of what laid ahead, as it gave us some direction of what might be interesting to look at.
Our analysis then began with an attempt to understand each variable in isolation: histograms, augmented with lines showing its mean, median and interquartile range, were of great help here. Boxplots gave use additional information about what was considered an outlier in each distribution. We also made use of faceted views and coloring to visualize single variables for different subgroups (e.g. males vs. females, people who signed up offline vs. on the web.)
Our analysis then proceeded to consider the relationship between pairs of continuous variables, and we realized that sometimes a direct plot of a variable against another doesn’t really provide much insight into the data. That’s when we turned to using summary statistics. We tried using both the mean and the median, but we settled for the median when trying to make conjectures because our dataset contained too much variation and outliers, making the mean a poor choice to understand the typical member of a population.
One of the initial difficulties we ran into during the analysis was related to the fact that the data wasn’t completely tidy. We realized this as we were doing the visualizations and had to step back and clean it up a bit before proceeding. We tried to mitigate this to some degree with some simple cleaning steps, but we were aware that more work would be needed for a completely clean dataset.
Another difficulty we had was interpreting some of the data correctly. Specifically, in the case of the subpopulations defined by the kind of registration, we were initially inclined to believe that the differences we saw in their distributions and statistical summaries might be an indication of a truly different kind of population. But, as progressed through the analysis, we came to realize that we should be very careful to jump to conclusions because those subpopulations were very small in size and therefore very prone to high variance.
We were able to answer all of the initial questions we had in mind, and we were also lucky to make some important discoveries regarding the adoption of the Ecobici program that may be very valuable to improve the program.
One of those discoveries is the unfortunate pattern that a considerable proportion of the people who sign up to the program end up using it just a couple of times, something that happened even more frequently in the case of women.
Another discovery we made was that, despite getting stuck in its growth during its second year of operation, the program made a comeback in its third and fourth year that attracted a lot of young people who are very active in terms of rides per week.
One piece of future work that would be fantastic is to create a heatmap of Mexico City showing the levels of adoption by region (i.e. municipality). This requires, of course, cleaning up the data more thoroughly, and getting a shapefile of all the municipalities that comprise Mexico City. With that geospatial data, we could easily visualize other things too, such as user adoption faceted by gender, by “activity status” (as measured by the number of rides in a year), etc.
Other possibilities for future analysis include merging other datasets (or their summaries) with this one to find even more interesting facts about the users of Ecobici. For instance, the website where we obtained the dataset used in this analysis, also makes available datasets regarding Ecobici stations and all rides taken since its launching in 2010. This could help us answer questions such as: what’s the average time that people of different ages exercise using Ecobici? What are some of the most/least commonly used stations at certain times during the day? Where should we place the next Ecobici station so that it benefits active and devoted users the most?